Specifying epitopes

Polyclonal sera contains antibodies that can target multiple distinct epitopes on a viral protein. While we do not know the exact composition of this sera and which epitopes are targeted, Polyclonal requires a user to specify the number of epitopes prior to fitting.

Here, we’ll use simulated data to show how incorrectly guessing the number of true epitopes affects the performance of Polyclonal models. We’ll also use this to provide guidance on how to systematically determine the optimal number of epitopes.

[1]:
import os
import pickle

import pandas as pd
import polyclonal

First, we read in a simulated “noisy” dataset containing 30,000 variants measured at three different sera concentrations. The variants in this library were simulated to contain a Poisson-distributed number of mutations, with an average of three mutations per gene.

[2]:
noisy_data = (
    pd.read_csv('RBD_variants_escape_noisy.csv', na_filter=None)
    .query('library == "avg3muts"')
    .query('concentration in [0.25, 1, 4]')
    .reset_index(drop=True)
    )
noisy_data
[2]:
library aa_substitutions concentration prob_escape IC90
0 avg3muts 0.25 0.00000 0.1128
1 avg3muts 0.25 0.01090 0.1128
2 avg3muts 0.25 0.01458 0.1128
3 avg3muts 0.25 0.09465 0.1128
4 avg3muts 0.25 0.03299 0.1128
... ... ... ... ... ...
89995 avg3muts Y449I L518Y C525R L461I 4.00 0.02197 2.3100
89996 avg3muts Y449V K529R N394R 4.00 0.04925 0.9473
89997 avg3muts Y451L N481T F490V 4.00 0.02315 0.9301
89998 avg3muts Y453R V483G L492V N501P I332P 4.00 0.00000 5.0120
89999 avg3muts Y489Q N501Y 4.00 0.00000 0.5881

90000 rows × 5 columns

We can also visualize the “true” mutation-escape values, \(\beta_{m,e}\), and wildtype activity values, \(a_{wt,e}\) as a reference for later.

[3]:
activity_wt_df = pd.read_csv('RBD_activity_wt_df.csv')
mut_escape_df = pd.read_csv('RBD_mut_escape_df.csv')

poly_abs = polyclonal.Polyclonal(
                activity_wt_df=activity_wt_df,
                mut_escape_df=mut_escape_df)
[4]:
poly_abs.activity_wt_barplot()
[4]:
[5]:
poly_abs.mut_escape_lineplot()
[5]:

Additionally, we’ll make a directory for storing our fit models as pickle files, so that we can conveniently load them in the future without having to fit again.

[6]:
os.makedirs('fit_polyclonal_models', exist_ok=True)

Specifying 3 epitopes

We’ll start by correctly initializing a Polyclonal model with three epitopes and fitting to the data.

Without seeding key sites

First, we’ll try fitting to the data without using any prior knowledge.

[7]:
# The key for the fit model
model_string = 'noisy_[0.25, 1, 4]conc_3muts_3epitopes_noinit'

# If the pickled model exists in the fit_polyclonal_models directory, load it.
if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
    model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
    print(f"Model with 3 epitopes specified was already fit.")
else:
    # Else, fit a model and save it to the fit_polyclonal_models directory.
    model = polyclonal.Polyclonal(data_to_fit=noisy_data,
                                  n_epitopes=3,
                                  )
    opt_res = model.fit(logfreq=500)
    pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))
Model with 3 epitopes specified was already fit.
[8]:
model.activity_wt_barplot()
[8]:
[9]:
# NBVAL_IGNORE_OUTPUT
model.mut_escape_lineplot()
[9]:

With seeding key sites

Next, we’ll add prior knowledge. We know from prior work the three most important epitopes and a key mutation in each, so we use this prior knowledge to “seed” initial guesses that assign large escape values to a key site in each epitope:

  • site 417 for class 1 epitope, which is often the least important

  • site 484 for class 2 epitope, which is often the dominant one

  • site 444 for class 3 epitope, which is often the second most dominant one

[10]:
# The key for the fit model
model_string = 'noisy_[0.25, 1, 4]conc_3muts_3epitopes'

# If the pickled model exists in the fit_polyclonal_models directory, load it.
if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
    model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
    print(f"Model with 3 epitopes specified was already fit.")
else:
    # Else, fit a model and save it to the fit_polyclonal_models directory.
    model = polyclonal.Polyclonal(data_to_fit=noisy_data,
                                  activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 1.0),
                                          ('2', 3.0),
                                          ('3', 2.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                  )
    opt_res = model.fit(logfreq=500)
    pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))
Model with 3 epitopes specified was already fit.
[11]:
model.activity_wt_barplot()
[11]:
[12]:
# NBVAL_IGNORE_OUTPUT
model.mut_escape_lineplot()
[12]:

With our without prior knowledge, the mutation escape values, \(\beta_{m,e}\), and wildtype activity values, \(a_{wt,e}\), inferred by the model strongly match the “true” values, as expected. However despite this, note one subtle difference (that is a consistent theme below) is that the model without prior knowledge calls site 507 as a more prominent site of escape in the class 1 epitope than the class 3 epitope, which is not true.

Specifying 2 epitopes

Now, we’ll try initializing a Polyclonal model with 2 epitopes and fitting to the data instead.

Without seeding key sites

First, we’ll try fitting to the data without using any prior knowledge.

[13]:
# The key for the fit model
model_string = 'noisy_[0.25, 1, 4]conc_3muts_2epitopes_noinit'

# If the pickled model exists in the fit_polyclonal_models directory, load it.
if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
    model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
    print(f"Model with 2 epitopes specified was already fit.")
else:
    # Else, fit a model and save it to the fit_polyclonal_models directory.
    model = polyclonal.Polyclonal(data_to_fit=noisy_data,
                                  n_epitopes=2,
                                  )
    opt_res = model.fit(logfreq=500)
    pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))
Model with 2 epitopes specified was already fit.
[14]:
model.activity_wt_barplot()
[14]:
[15]:
model.mut_escape_lineplot()
[15]:

With seeding key sites

Next, we’ll add prior knowledge. We’ll ignore the most subdominant class 1 epitope to simulate a scenario where we did not have any prior knowledge of its mutations having an effect on escape.

[16]:
# The key for the fit model
model_string = 'noisy_[0.25, 1, 4]conc_3muts_2epitopes'

# If the pickled model exists in the fit_polyclonal_models directory, load it.
if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
    model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
    print(f"Model with 2 epitopes specified was already fit.")
else:
    # Else, fit a model and save it to the fit_polyclonal_models directory.
    model = polyclonal.Polyclonal(data_to_fit=noisy_data,
                                  activity_wt_df=pd.DataFrame.from_records(
                                         [('2', 2.0),
                                          ('3', 1.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                  )
    opt_res = model.fit(logfreq=500)
    pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))
Model with 2 epitopes specified was already fit.
[17]:
model.activity_wt_barplot()
[17]:
[18]:
model.mut_escape_lineplot()
[18]:

We observe that this model identifies the correct escape mutations at epitopes 2 and 3, the escape mutations in the class 1 epitope are not present, and interestingly the wildtype activity values, \(a_{wt,e}\), for epitopes 2 and 3 are slightly more positive than their “true” values. Again, we also observe strong concordance between the models fit with and without prior knowledge, although the epitopes profiles are flipped since the model fit without prior knowledge has arbitrary epitope labels.

4 epitopes

Lastly, we’ll try initializing a Polyclonal model with 4 epitopes and fitting to the data.

Without seeding key sites

First, we’ll try fitting to the data without using any prior knowledge.

[19]:
# The key for the fit model
model_string = 'noisy_[0.25, 1, 4]conc_3muts_4epitopes_noinit'

# If the pickled model exists in the fit_polyclonal_models directory, load it.
if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
    model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
    print(f"Model with 4 epitopes specified was already fit.")
else:
    # Else, fit a model and save it to the fit_polyclonal_models directory.
    model = polyclonal.Polyclonal(data_to_fit=noisy_data,
                                  n_epitopes=4,
                                  )
    opt_res = model.fit(logfreq=500)
    pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))
Model with 4 epitopes specified was already fit.
[20]:
model.activity_wt_barplot()
[20]:
[21]:
model.mut_escape_lineplot()
[21]:

With seeding key sites (and a site not in an epitope)

We’ll now try adding prior knowledge. But, what happens when we guess a key site that is not actually in an epitope?

[22]:
# The key for the fit model
model_string = 'noisy_[0.25, 1, 4]conc_3muts_4epitopes'

# If the pickled model exists in the fit_polyclonal_models directory, load it.
if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
    model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
    print(f"Model with 4 epitopes specified was already fit.")
else:
    # Else, fit a model and save it to the fit_polyclonal_models directory.
    model = polyclonal.Polyclonal(data_to_fit=noisy_data,
                                  activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 2.0),
                                          ('2', 4.0),
                                          ('3', 3.0),
                                          ('4', 1.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ('4', 386, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                  )
    opt_res = model.fit(logfreq=500)
    pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))
Model with 4 epitopes specified was already fit.
[23]:
model.activity_wt_barplot()
[23]:
[24]:
model.mut_escape_lineplot()
[24]:

For class 1, 2, and 3 epitopes, the mutation escape values, \(\beta_{m,e}\), and wildtype activity values, \(a_{wt,e}\), inferred by the model strongly match the “true” values. The class 4 epitope, which is redundant, contained no clear escape mutations and had a strongly negative wildtype activity value, \(a_{wt,e}\), suggesting that antibodies targeting that “epitope” don’t make an appreciable contribution to the observed data. However, we do see our initial guess has had a slight effect, as the class 4 epitope \(\beta_{m,e}\)’s are noisier than the class 4 epitope \(\beta_{m,e}\)’s of the model fit with no prior knowledge.

With seeding key sites (and two sites in the same epitope)

Lastly, we’ll keep our initial guesses for the first 3 epitopes, but try guessing an additional key site that is already located in an epitope that we’ve specified. Specifically, we’ll seed it with site 460, which is in the class 1 epitope and has many escape mutations.

[25]:
# The key for the fit model
model_string = 'noisy_[0.25, 1, 4]conc_3muts_4epitopes_2'

# If the pickled model exists in the fit_polyclonal_models directory, load it.
if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
    model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
    print(f"Model with 4 epitopes specified was already fit.")
else:
    # Else, fit a model and save it to the fit_polyclonal_models directory.
    model = polyclonal.Polyclonal(data_to_fit=noisy_data,
                                  activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 2.0),
                                          ('2', 4.0),
                                          ('3', 3.0),
                                          ('4', 1.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ('4', 460, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                  )
    opt_res = model.fit(logfreq=500)
    pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))
Model with 4 epitopes specified was already fit.
[26]:
model.activity_wt_barplot()
[26]:
[27]:
model.mut_escape_lineplot()
[27]:

We see a similar result to what we got from the model fit when we guessed a key site that was not actually an epitope. Although the class 4 epitope was seeded with a different key mutation in the class 1 epitope, it still contained no clear escape mutations.

Summary

These simulation experiments provide a general guideline for specifying the number of epitopes. When fitting Polyclonal models, one can start with 1 epitope and iteratively fit models with increasing number of epitopes. At some point, the newly seeded \(N\)-th epitope will become redundant, as evidenced by a profile of near-zero mutation escape values, \(\beta_{m,e}\), and a strongly negative wildtype activity value, \(a_{wt,e}\). This is indication to the user that the previous fit model, containing \(N - 1\) epitopes, is the one that best captures the data and polyclonal mix.